How safe is flying really?

A human centered, time series aviation analysis.

Lauren Heintz
DATA 512, Fall 2019
Due 12/12/19

I. Introduction

While I am studying part time to become a data scientist, my full-time job is at The Boeing Company. There has recently been a lot of news and media on Boeing surrounding the 737Max crashes. This brings attention to air travel and the safety of air travel.

I hope that I can provide some scientific, fact-based, data-based information to these busy American travelers. Hopefully my research can provide conclusions that any adult reader can understand and find value in. This analysis will also strive to provide a historical lens from which to view the recent events. I will focus on United States air travel specifically.

In my analysis I plan to investigate the trends in aviation accidents and fatalities over time. I will also look at tangible factors that all readers can understand and use descriptive statistics and visualizations to bring light to patterns if they exist.

II. Towards Human-Centered Data Science

I did not come across any ethical concerns in finding this aviation data or utilizing it. It is federally licensed and distributed - open for public consumption. In fact, the largest data set is from the "ASIAS" project - Aviation Safety Information Analysis and Sharing. It appears the intention behind making this data available is to encourage transparency and analysis in the aviation industry.

There are several good reasons to perform a human centered data science investigation on the safety of air travel. The first reason is that with the recent media coverage surrounding the Boeing 737Max crashes, there is more journalism than ever on the topic. This journalism however is not always fact based or unbiased. The more media coverage there is, the more difficult it becomes to understand what is actually true. I believe it has gotten more and more difficult for American readers to sort through the information in front of them and determine what is subjective vs. objective. Many Americans are tired of this and want to get straight to the facts about air travel safety. I think my analysis could provide a means to do this.

Second, air travel has become increasingly popular over the past 3 decades, which we will see in my analysis below. Air travel is becoming more popular in the commercial sector specifically, as businesses depend on airplanes for international collaboration. On an individual level, recreational travel and tourism has gotten increasingly popular as prices for airplane tickets have dropped. This is a good reason to stop and take a moment to reflect on where air travel safety has been, and where it is headed as demand increases in the coming decade.

Finally, perhaps the most important reason for a human centered analysis on the topic: the laws of aerodynamics are complicated! So complicated that a graduate level degree in an aeronautical field is required to truly understand how planes work. For a naive reader not a part of the aerospace industry, it is hard to bridge this gap to truly understand the issues that the industry faces. This gap in understanding of how planes work leads to a gap in trust between the customers and the products. If customers don't understand what factors make a plane safe or not, how can they make an informed decision? How will they know whether or not to trust this flying metal bird in the sky? For the future of the aerospace industry, we must make products our customers have confidence in! This is why this type of work is even more important.

III. Background & Research

Upon some google searching about the safety of flight, one article brings up some interesting statistics which immediately point to the overwhelming safety of air travel in comparison to other modes of transportation. "There are a range of estimates out there, but based on its analysis of US Census data, it puts the odds of dying as a plane passenger at 1 in 205,552. That compares with odds of 1 in 4,050 for dying as a cyclist; 1 in 1,086 for drowning, and 1 in 102 for a car crash." (SBS) This article goes on to list some numbers about accidents, however, it only lists accidents from the US and Canada from 2013 to 2017. After that, the analysis looks at all countries, which my analysis does not. It appears to use the US National Safety Council data, not FAA data.

Another top search result, a self help blog for people with flying anxiety also compares air travel to other modes of travel. It states that "In fact, based on this incredible safety record, if you did fly every day of your life, probability indicates that it would take you nineteen thousand years before you would succumb to a fatal accident. Nineteen thousand years!" An additional comparison to the dangers of driving points out that a sold-out 727 jet would have to crash every day of the week, with no survivors, to equal the highway deaths per year in this country."

It points to outside sources for the following numbers:
DEATH BY: YOUR ODDS
Cardiovascular disease: 1 in 2
Smoking (by/before age 35): 1 in 600
Car trip, coast-to-coast: 1 in 14,000
Bicycle accident: 1 in 88,000
Tornado: 1 in 450,000
Train, coast-to-coast: 1 in 1,000,000
Lightning: 1 in 1.9 million
Bee sting: 1 in 5.5 million
U.S. commercial jet airline: 1 in 7 million
[Sources: Natural History Museum of Los Angeles County, Massachusetts Institute of Technology, University of California at Berkeley]

These numbers vary significantly from the previous article, which could have to do with the date that it was written. It could also simply be because these odds are hard to estimate. Not only do you need to use an appropriate accident, you also need to appropriately estimate how often someone is flying or driving. For this very reason, I did not seek to incorporate other modes of transportation in to my analysis. The data was simply not available or accurate. In any case, it does not deep dive in to accidents, pilots, or any sort of time series analysis.

Finally, probably the most relevant result I found was a bloomberg article warning that flying has become more dangerous recently. This article is touching on the same hot topic that I wish to address. While it does include one bar chart of total annual fatalities from 2010 to 2018, it does not include any other visuals. Even this visual is based on the world passenger airline fatality data, not the United States. It does not appear to be the same data set as mine, and the article quickly moves on to other factors in airplane safety such as demand for speed and cost cutting by airlines, as well as an increased burden of safety regulation.

I am excited say that despite the recent media craze, there really doesn't seem to be many data driven articles or publications recently to appeal to the general public in regards to the safety of air travel. This means that my research could really provide some benefit to the general public. It does not appear that my work is directly building off of anyone's previous. Since FAA data is widely available, I would not be surprised if someone had already done this analysis, however, it was not published and popularized anywhere that I found.

In my attempts to simplify aviation jargon and confusing regulatory terminology, I made this table below to explain some of the terms that may be used through out my project.

Through out the rest of this analysis, the terms 'private' and 'public'/'commercial' will be used to short hand reference some of these distinctions in definitions below.

Term Definition
NTSB National Transportation Safety Board
14 CFR-121 Specification of Operating Requirements: Domestic, Flag, and Supplemental Operations, more than 10 passengers, scheduled service
Domestic Operation Departure and Arrival Airport both within US
Flag Operation Departure from US Airport, Arrival in non-US Airport
Supplemental Operation Departure and Arrival from US Airports, cargo or large charter
Accident An accident in these data sets means that an illegal act such as suicide, sabotage, or terrorism was NOT responsible for the occurrence. For example, 9/11 2001 terrorism fatalies are excluded but available another data set.
Accident, Major Major - an accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged.
Accident, Serious Serious - an accident in which at least one of two conditions is met: 1) there was one fatality without substantial damage to a Part 121 aircraft, 2) there was at least one serious injury and a Part 121 aircraft was substantially damaged.
Accident, Injury Injury - a nonfatal accident with at least one serious injury and without substantial damage to a Part 121 aircraft.
Accident, Damage Damage - an accident in which no person was killed or seriously injured, but in which any aircraft was substantially damaged.

IV. Data Definition

Source File Name Table Name Years Data Source
1 faaAccidentIncidentDataSystem.csv FAA Accident and Incident Data System (AIDS) 1978 - 2015 https://www.asias.faa.gov/apex/f?p=100:11:::NO:::
2 accidentsAccidentRates_scheduledPass.csv Accidents and Accident Rates by NTSB Classification, 1995 through 2014, for U.S. Air Carriers Operating Under 14 CFR 121 1983 - 2014 https://catalog.data.gov/dataset/accidents-and-accident-rates-by-ntsb-classification-1995-through-2014-for-u-s-air-carriers
3 accidentsFatalitiesRates_airlines.csv Accidents, Fatalities, and Rates, 1995 through 2014, for U.S. Air Carriers Operating Under 14 CFR 121, Scheduled and Nonscheduled Service (Airlines) 1983 - 2014 https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-for-u-s-air-carriers-operating-under-14-c-dae36
4 accidentsFatalitiesRates_genAv.csv Accidents, Fatalities, and Rates, 1995 through 2014, U.S. General Aviation 1975 - 2014 https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-u-s-general-aviation

There are multiple data sets that I plan to use in tandem to complete my analysis. The most detailed data set is the FAA Accident & Incident Data System - which contains a detailed record for every single accident and incident from ~1978 to ~2015 (with MM/DD/YYYY available for each) for any type of United States flight (private, commercial, scheduled, unschedule, cargo, passenger). Examples of the relevant fields of data available for each row item are: Local Event Date, Event City, Event State, Event Airport, Event Type, Aircraft Damage, Flight Phase, Aircraft Make, Aircraft Model, Aircraft Series, Operator, Primary Flight Type, Total Fatalities, Total Injuries, PIC Certificate Type, PIC Flight Time Total Hrs, PIC Flight Time Total Make-Model.

The other data sets are at an aggregate level and may be more useful for those less familiar with aviation terminology. The second and third data sets are for scheduled, passenger flights - so generally commercial airlines. The second source contains more details as to the severity of accidents and the third source contains more details as to the number of fatalities. They have several columns of redudant data though, so I will likely join these two data sets and eliminate the redudandant columns and keep all unique columns so that I have the maximum granularity for these two data sets. The second data sets contains the following fields: Year, Accidents: Major,Accidents: Serious, Accidents: Injury, Accidents: Damage, Aircraft hours flown (millions), Accidents per Million Hours Flown: Major, Accidents per Million Hours Flown: Serious, Accidents per Million Hours Flown: Injury, Accidents per Million Hours Flown: Damage.

The third data set contains the following fields: Year, Accidents: All, Accidents, Fatal, Fatalities: Total, Fatalies: Aboard, Flight Hours, Miles Flown, Departures, Accidents per 100,000 Flight Hours: All, Accidents per 100,000 Flight Hours: Fatal, Accidents per 1,000,000 Miles Flown: All, Accidents per 1,000,000 Miles Flown: Fatal, Accidents per 100,000 Departures: All, Accidents per 100,000 Departures: Fatal.

The fourth data set is general aviation, so it includes private and personal flights, not just scheduled commercial passenger flights. Therefore it includes more types of planes, smaller planes, and more airports. That information is not detailed in this data set, but on a whole it represents way more flight hours and way more total accidents. It contains the fields: Year, Accidents: All, Accidents: Fatal, Fatalities: Total, Fatalities: Aboard, Flight Hours, Accidents per 100,000 Flight Hours: All, Accidents per 100,000 Flight Hour: Fatal.

V. Research Questions & Methodology

For each research question proposed, I will now explain my methodology for this analysis and presentation to the reader.

  1. How has the safety of commercial* air travel in the United States changed over the past ~30 years?
  • 1 a. How has the number of accidents/fatalies in the United States changed over the past ~30 years?

For this, I will use data sources 2 & 3 to plot the aggregate totals either as a line or bar chart. Source 2 will be for accidents and source 3 will be for fatalities. I will make one plot for fatalities and one plot for accidents. For accidents, I will either plot multiple lines in different colors to denote the different types of accident totals, or used a stacked bar chart. For fatalities I will also use a line or bar chart.

  • 1 b. How has the ratio of accidents/fatalies per annual miles flown in the United States changed over the past ~30 years?

For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross miles flown per year. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of miles flown per year. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.

  • 1 c. How has the ratio of accidents/fatalies per annual flights hours in the United States changed over the past ~30 years?

For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross annual flight hours. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of annual flight hours. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.

  • 1 d. How do the rates of accidents/fatalities vary between commercial* and all public and private flights?

After addressing parts 1a-1c for just commercial* flights, I have data in table 4 contains the same accident, fatality, and rate data but instead for general aviation in the United States. I will use this to compare the relative safety of commercial flight vs general flight (include private and personal flights). I will choose just one metric - accidents per flight hour, and plot both the commercial and general aviation lines on the same plot in different colors.

  1. What kinds of planes are responsible for the most accidents?

Finally, in looking at source 1, my most detailed data source, I can see more granular information for each accident. This contains the type of airplane for every data entry. To present to the reader what planes are responsible for the most crashes, I will create a pivot table that sums the number of accidents for each plane. Then, I will create a table for the reader that displays just the top 5 planes. Since I do not expect the reader to have a knowledge of all types of planes, I will either include an image or brief description of the plane.

  1. How does the experience level of the pilots impact airplane crashes?

A question a reader may ask if they are boarding a plane is - how experienced is my pilot? Data source 1, the detailed table, also contains a column with number of hours of experience of the Pilot In Control for most accidents. I will extract this column and create a histogram so that the reader can see the distribution of number of experience hours of pilots who have been in crashes, and observe for themselves if they think that number of hours could be a factor in safety.

VI. Data Processing

In [127]:
import pandas as pd
import os
import numpy as np
import copy
import matplotlib.pyplot as plt
from IPython.display import Image
In [141]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
/Users/laurenheintz/Docs/MSDS/Fa2019/data512/FinalProject/data_raw

Import and Process Data Sources 2 and 3

Data sources 2 and 3 both pertain to commercial airline data. Since there is quite a bit of overlap between the information in item 2 and 3, the first data processing task will be to join the two datasets.

Our final cleaned, combined file of airline data will be saved as data_clean\airline_aggregate.csv

In [3]:
# Read in accidents data and accidents+fatalies data

accidents_df = pd.read_csv('accidentsAccidentRates_scheduledPass.csv', sep=',', header=0) # source 2
accidents_fatal_df = pd.read_csv('accidentsFatalitiesRates_airlines.csv', sep=',', header=0) # source 3
In [4]:
# Look at columns in data set 2

accidents_df.head(1)
list(accidents_df.columns)
Out[4]:
['Year',
 'Accidents, Major',
 'Accidents, Serious',
 'Accidents, Injury',
 'Accidents, Damage',
 'Aircraft Hours Flown (millions)',
 'Accidents per Million Hours Flown, Major',
 'Accidents per Million Hours Flown, Serious',
 'Accidents per Million Hours Flown, Injury',
 'Accidents per Million Hours Flown, Damage']
In [5]:
# Look at columns in data set 3

list(accidents_fatal_df.columns)
Out[5]:
['Year',
 'Illegal Act',
 'Accidents, All',
 'Accidents, Fatal',
 'Fatalities, Total',
 'Fatalities, Aboard',
 'Flight Hours',
 'Miles Flown',
 'Departures',
 'Accidents per 100,000 Flight Hours, All',
 'Accidents per 100,000 Flight Hours, Fatal',
 'Accidents per 1,000,000 Miles Flown, All',
 'Accidents per 1,000,000 Miles Flown, Fatal',
 'Accidents per 100,000 Departures, All',
 'Accidents per 100,000 Departures, Fatal']
In [6]:
# Merge data sets 2 and 3

accidents_all_df = pd.merge(accidents_df, accidents_fatal_df, how = 'inner')

For better understanding and readability, we will drop one redundant column and reorder the remaining columns so that similar data is grouped together.

In [7]:
# Drop a redundant column

accidents_all_df = accidents_all_df.drop(columns=['Aircraft Hours Flown (millions)'])
In [8]:
# Reorder columns

accidents_all_df_reordered = accidents_all_df[['Year',
                                               'Illegal Act',
                                               'Flight Hours',
                                               'Miles Flown',
                                               'Departures',
                                               'Accidents, Major',
                                               'Accidents, Serious',
                                               'Accidents, Injury',
                                               'Accidents, Damage',
                                               'Accidents, All',
                                               'Accidents, Fatal',
                                               'Fatalities, Total',
                                               'Fatalities, Aboard',
                                               'Accidents per Million Hours Flown, Major',
                                               'Accidents per Million Hours Flown, Serious',
                                               'Accidents per Million Hours Flown, Injury',
                                               'Accidents per Million Hours Flown, Damage',
                                               'Accidents per 100,000 Flight Hours, All',
                                               'Accidents per 100,000 Flight Hours, Fatal',
                                               'Accidents per 1,000,000 Miles Flown, All',
                                               'Accidents per 1,000,000 Miles Flown, Fatal',
                                               'Accidents per 100,000 Departures, All',
                                               'Accidents per 100,000 Departures, Fatal']]
In [9]:
accidents_all_df_reordered.head(10)
Out[9]:
Year Illegal Act Flight Hours Miles Flown Departures Accidents, Major Accidents, Serious Accidents, Injury Accidents, Damage Accidents, All ... Accidents per Million Hours Flown, Major Accidents per Million Hours Flown, Serious Accidents per Million Hours Flown, Injury Accidents per Million Hours Flown, Damage Accidents per 100,000 Flight Hours, All Accidents per 100,000 Flight Hours, Fatal Accidents per 1,000,000 Miles Flown, All Accidents per 1,000,000 Miles Flown, Fatal Accidents per 100,000 Departures, All Accidents per 100,000 Departures, Fatal
0 1983 No 7,298,799 3,069,318,000 5,444,374 4 2 9 8 23 ... 0.548 0.274 1.233 1.096 0.315 0.055 0.0075 0.0013 0.422 0.073
1 1984 No 8,165,124 3,428,063,000 5,898,852 2 2 6 6 16 ... 0.245 0.245 0.735 0.735 0.196 0.012 0.0047 0.0003 0.271 0.017
2 1985 No 8,709,894 3,631,017,000 6,306,759 8 2 5 6 21 ... 0.918 0.230 0.574 0.689 0.241 0.08 0.0058 0.0019 0.333 0.111
3 1986 Yes 9,976,104 4,017,626,000 7,202,027 4 0 14 6 24 ... 0.401 0.000 1.403 0.601 0.231 0.02 0.0057 0.0005 0.319 0.028
4 1987 Yes 10,645,192 4,360,521,000 7,601,373 5 1 12 16 34 ... 0.470 0.094 1.127 1.503 0.310 0.038 0.0076 0.0009 0.434 0.053
5 1988 Yes 11,140,548 4,503,426,000 7,716,061 4 2 13 11 30 ... 0.359 0.180 1.167 0.987 0.260 0.018 0.0064 0.0004 0.376 0.026
6 1989 No 11,274,543 4,605,083,000 7,645,494 8 4 6 10 28 ... 0.710 0.355 0.532 0.887 0.248 0.098 0.0061 0.0024 0.366 0.144
7 1990 No 12,150,116 4,947,832,000 8,092,306 4 3 10 7 24 ... 0.329 0.247 0.823 0.576 0.198 0.049 0.0049 0.0012 0.297 0.074
8 1991 No 11,780,610 4,824,824,000 7,814,875 5 2 10 9 26 ... 0.424 0.170 0.849 0.764 0.221 0.034 0.0054 0.0008 0.333 0.051
9 1992 No 12,359,715 5,039,435,000 7,880,707 3 3 10 2 18 ... 0.243 0.243 0.809 0.162 0.146 0.032 0.0036 0.0008 0.228 0.051

10 rows × 23 columns

In this next step of data cleaning, we check the data types of the all the items in the data set and change them where needed to suit our computation.

In [10]:
accidents_all_df_reordered.dtypes
Out[10]:
Year                                            int64
Illegal Act                                    object
Flight Hours                                   object
Miles Flown                                    object
Departures                                     object
Accidents, Major                                int64
Accidents, Serious                              int64
Accidents, Injury                               int64
Accidents, Damage                               int64
Accidents, All                                  int64
Accidents, Fatal                                int64
Fatalities, Total                               int64
Fatalities, Aboard                              int64
Accidents per Million Hours Flown, Major      float64
Accidents per Million Hours Flown, Serious    float64
Accidents per Million Hours Flown, Injury     float64
Accidents per Million Hours Flown, Damage     float64
Accidents per 100,000 Flight Hours, All       float64
Accidents per 100,000 Flight Hours, Fatal      object
Accidents per 1,000,000 Miles Flown, All      float64
Accidents per 1,000,000 Miles Flown, Fatal     object
Accidents per 100,000 Departures, All         float64
Accidents per 100,000 Departures, Fatal        object
dtype: object

Not all of the items are the correct data type. All of the items that are objects should be changed to int64 with the exception of Illegal Act which is a string (Yes or No), and rates which should be float64.

In [11]:
# Remove commas from numbers which are currently string objects

accidents_all_df_no_comma = accidents_all_df_reordered.replace(',','', regex=True)
In [12]:
# Update types of columns

accidents_all_df_no_comma['Illegal Act'] = accidents_all_df_no_comma['Illegal Act'].astype(str)
accidents_all_df_no_comma['Flight Hours'] = accidents_all_df_no_comma['Flight Hours'].astype(int)
accidents_all_df_no_comma['Miles Flown'] = accidents_all_df_no_comma['Miles Flown'].astype(int)
accidents_all_df_no_comma['Departures'] = accidents_all_df_no_comma['Departures'].astype(int)
In [13]:
# Fill items with dashes with NaN

accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].loc[accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].str.contains('-')] = np.NaN
accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].loc[accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].str.contains('-')] = np.NaN
accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].loc[accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].str.contains('-')] = np.NaN

# Convert remaining object columns to floats

accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'] = accidents_all_df_no_comma['Accidents per 100,000 Flight Hours, Fatal'].astype(float)
accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'] = accidents_all_df_no_comma['Accidents per 1,000,000 Miles Flown, Fatal'].astype(float)
accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'] = accidents_all_df_no_comma['Accidents per 100,000 Departures, Fatal'].astype(float)

accidents_all_clean = accidents_all_df_no_comma.copy()
accidents_all_clean.dtypes
/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
Out[13]:
Year                                            int64
Illegal Act                                    object
Flight Hours                                    int64
Miles Flown                                     int64
Departures                                      int64
Accidents, Major                                int64
Accidents, Serious                              int64
Accidents, Injury                               int64
Accidents, Damage                               int64
Accidents, All                                  int64
Accidents, Fatal                                int64
Fatalities, Total                               int64
Fatalities, Aboard                              int64
Accidents per Million Hours Flown, Major      float64
Accidents per Million Hours Flown, Serious    float64
Accidents per Million Hours Flown, Injury     float64
Accidents per Million Hours Flown, Damage     float64
Accidents per 100,000 Flight Hours, All       float64
Accidents per 100,000 Flight Hours, Fatal     float64
Accidents per 1,000,000 Miles Flown, All      float64
Accidents per 1,000,000 Miles Flown, Fatal    float64
Accidents per 100,000 Departures, All         float64
Accidents per 100,000 Departures, Fatal       float64
dtype: object
In [14]:
accidents_all_clean = accidents_all_clean.set_index(accidents_all_df_no_comma['Year'])

Now let's save the file we have cleaned up to our clean data folder.

In [15]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/
/Users/laurenheintz/Docs/MSDS/Fa2019/data512/FinalProject
In [16]:
accidents_all_clean.to_csv('data_clean/airline_aggregate.csv')

Import and Process Data Source 4

Data source 4 contains similar accident and fatality roll up data as sources 2 and 3, but instead of for commercial flights it is for all general aviation in the United States.

Our final cleaned file of general av data will be saved as data_clean\genav_aggregate.csv

In [17]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
/Users/laurenheintz/Docs/MSDS/Fa2019/Data
In [18]:
# Read in accidents data for all aviation in the US

genav_df = pd.read_csv('accidentsFatalitiesRates_genAv.csv', sep=',', header=0) # source 4
In [121]:
genav_df
Out[121]:
Year Accidents, All Accidents, Fatal Fatalities, Total Fatalities, Aboard Flight Hours Accidents per 100,000 Flight Hour, All Accidents per 100,000 Flight Hour, Fatal
0 1975 3995 633 1252 1231 28,799,000 13.87 2.19
1 1976 4018 658 1216 1203 30,476,000 13.17 2.16
2 1977 4079 661 1276 1265 31,578,000 12.91 2.09
3 1978 4216 719 1556 1398 34,887,000 12.08 2.06
4 1979 3818 631 1221 1203 38,641,000 9.88 1.63
5 1980 3590 618 1239 1230 36,402,000 9.86 1.69
6 1981 3500 654 1282 1261 36,803,000 9.51 1.78
7 1982 3,233 591 1187 1171 29,640,000 10.82 1.96
8 1983 3,075 555 1,068 1,061 28,673,000 10.67 1.92
9 1984 3,017 545 1,042 1,021 29,099,000 10.28 1.84
10 1985 2,739 498 956 945 28,322,000 9.63 1.74
11 1986 2,581 474 967 879 27,073,000 9.49 1.73
12 1987 2,494 446 837 822 26,972,000 9.18 1.63
13 1988 2,388 460 797 792 27,446,000 8.65 1.66
14 1989 2,242 432 769 766 27,920,000 7.97 1.52
15 1990 2,242 444 770 765 28,510,000 7.85 1.55
16 1991 2,197 439 800 786 27,678,000 7.91 1.57
17 1992 2,110 450 866 864 24,780,000 8.51 1.81
18 1993 2,064 401 744 740 22,796,000 9.03 1.74
19 1994 2,021 404 730 723 22,235,000 9.08 1.81
20 1995 2,056 412 734 727 24,906,000 8.21 1.63
21 1996 1,908 361 636 619 24,881,000 7.65 1.45
22 1997 1,840 350 631 625 25,591,000 7.17 1.36
23 1998 1,902 364 624 618 25,518,000 7.43 1.41
24 1999 1,905 340 621 615 29,246,000 6.5 1.16
25 2000 1,837 345 596 585 27,838,000 6.57 1.21
26 2001 1,727 325 562 558 25,431,000 6.78 1.27
27 2002 1,716 345 581 575 25,545,000 6.69 1.33
28 2003 1,741 352 633 630 25,998,000 6.68 1.34
29 2004 1,619 314 559 559 24,888,000 6.49 1.26
30 2005 1,671 321 563 558 23,168,000 7.2 1.38
31 2006 1,523 308 706 547 23,963,000 6.35 1.28
32 2007 1,654 288 496 491 23,819,000 6.94 1.2
33 2008 1,568 277 496 487 22,805,000 6.87 1.21
34 2009 1,480 275 479 470 20,862,000 7.08 1.32
35 2010 1,440 271 458 455 21,688,000 6.63 1.24
36 2011 1,470 269 452 441 - - -
37 2012 1,470 272 437 437 20,881,000 7.04 1.3
38 2013 1,224 222 391 386 19,492,000 6.26 1.12
39 2014 1,221 253 419 410 18,103,000 6.74 1.4
In [19]:
# Fill any dashes with 0 so the row can be converted to int

genav_df['Flight Hours'].loc[genav_df['Flight Hours'].str.contains('-')] = 0
genav_df['Accidents per 100,000 Flight Hour, All'].loc[genav_df['Accidents per 100,000 Flight Hour, All'].str.contains('-')] = 0
genav_df['Accidents per 100,000 Flight Hour, Fatal'].loc[genav_df['Accidents per 100,000 Flight Hour, Fatal'].str.contains('-')] = 0
In [123]:
genav_df.dtypes
Out[123]:
Year                                         int64
Accidents, All                              object
Accidents, Fatal                             int64
Fatalities, Total                           object
Fatalities, Aboard                          object
Flight Hours                                object
Accidents per 100,000 Flight Hour, All      object
Accidents per 100,000 Flight Hour, Fatal    object
dtype: object

Not all of the items are the correct data type. All of the items that are objects should be changed to int64 with the exception of 'Accidents per 100,000' which should be float64.

In [20]:
# Remove commas from numbers which are currently string objects
genav_no_comma = genav_df.replace(',','', regex=True)
In [21]:
# Update some columns from objs to ints
genav_no_comma['Accidents, All'] = genav_no_comma['Accidents, All'].astype(int)
genav_no_comma['Fatalities, Total'] = genav_no_comma['Fatalities, Total'].astype(int)
genav_no_comma['Fatalities, Aboard'] = genav_no_comma['Fatalities, Aboard'].astype(int)
genav_no_comma['Flight Hours'] = genav_no_comma['Flight Hours'].astype(int)

# Update rate columns from objs to floats
genav_no_comma['Accidents per 100,000 Flight Hour, All'] = genav_no_comma['Accidents per 100,000 Flight Hour, All'].astype(float)
genav_no_comma['Accidents per 100,000 Flight Hour, Fatal'] = genav_no_comma['Accidents per 100,000 Flight Hour, Fatal'].astype(float)
In [22]:
# Change rows with zeros back to np.nan
genav_no_comma = genav_no_comma.replace(0,np.nan)
In [23]:
genav_clean = genav_no_comma.copy() 
genav_clean.dtypes
Out[23]:
Year                                          int64
Accidents, All                                int64
Accidents, Fatal                              int64
Fatalities, Total                             int64
Fatalities, Aboard                            int64
Flight Hours                                float64
Accidents per 100,000 Flight Hour, All      float64
Accidents per 100,000 Flight Hour, Fatal    float64
dtype: object
In [24]:
genav_clean = genav_clean.set_index(genav_clean['Year'])

Now that we have finished cleaning the data lets save the final product to our clean data folder.

In [25]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/
/Users/laurenheintz/Docs/MSDS/Fa2019/data512/FinalProject
In [26]:
genav_clean.to_csv('data_clean/genav_aggregate.csv')

Import and Process Data Source 1 ASIAS Database

In [142]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
/Users/laurenheintz/Docs/MSDS/Fa2019/data512/FinalProject/data_raw
In [28]:
# Read in accidents data for all aviation in the US

faa_aids = pd.read_csv('faaAccidentIncidentDataSystem.csv', sep=',', header=0) # source 1
/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (12,14) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [29]:
# Check data types

faa_aids.dtypes
Out[29]:
AIDS Report Number                   object
Local Event Date                     object
Event City                           object
Event State                          object
Event Airport                        object
Event Type                           object
Aircraft Damage                      object
Flight Phase                         object
Aircraft Make                        object
Aircraft Model                       object
Aircraft Series                      object
Operator                             object
Primary Flight Type                  object
Flight Conduct Code                  object
Flight Plan Filed Code               object
Aircraft Registration Nbr            object
Total Fatalities                      int64
Total Injuries                        int64
Aircraft Engine Make                 object
Aircraft Engine Model                object
Engine Group Code                    object
Nbr of Engines                      float64
PIC Certificate Type                 object
PIC Flight Time Total Hrs           float64
PIC Flight Time Total Make-Model    float64
dtype: object

Since we will be using the Aircraft Make, Model, and Series I will convert these to strings.

In [30]:
# Cast types as strings

faa_aids['Aircraft Make'] = faa_aids['Aircraft Make'].astype(str)
faa_aids['Aircraft Model'] = faa_aids['Aircraft Model'].astype(str)
faa_aids['Aircraft Series'] = faa_aids['Aircraft Series'].astype(str)

No more data processing is needed for this data set. There were not significant changes that merit saving another copy.

VII. Data Analysis

In [49]:
%cd ~/Docs/MSDS/Fa2019/data512/FinalProject
/Users/laurenheintz/Docs/MSDS/Fa2019/data512/FinalProject
  • 1 a. How has the number of accidents/fatalies in the United States changed over the past ~30 years?

For this, I will use data sources 2 & 3 to plot the aggregate totals either as a line or bar chart. Source 2 will be for accidents and source 3 will be for fatalities. I will make one plot for fatalities and one plot for accidents. For accidents, I will either plot multiple lines in different colors to denote the different types of accident totals, or used a stacked bar chart. For fatalities I will also use a line or bar chart.

Below, we will plot the number of annual accidents as a simple line plot.

In [107]:
# Set up the plot

fig = plt.figure(1, figsize=(18, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot multiple lines
plt.plot(accidents_all_clean['Accidents, Major'], "--", color = 'orange', label = 'Accidents, Major')
plt.plot(accidents_all_clean['Accidents, Serious'], "--", color = 'green', label = 'Accidents, Serious')
plt.plot(accidents_all_clean['Accidents, Injury'], "--", color = 'blue', label = 'Accidents, Injury')
plt.plot(accidents_all_clean['Accidents, Damage'], "--", color = 'purple', label = 'Accidents, Damage')
plt.plot(accidents_all_clean['Accidents, Fatal'], "--", color = 'red', label = 'Accidents, Fatal')
plt.plot(accidents_all_clean['Accidents, All'], "-", color = 'black', label = 'Accidents, All')

# Add titles, axes labels, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')
plt.legend(loc='upper left')

fig.savefig('results/accidents_lines.png')
plt.show()

From this chart, we surprisingly see that the overall number of accidents does not have a clear trend up or down. It is a bit more volatile. However, we notice that what is driving that is the purple and blue lines - the "damage" and "injury" accidents which are the least serious of all the accident categories. For "major", "serious", and "fatal" accidents, we actually do see a slight trend downward and an overall lower number.

Recall the definitions of the accidents types from the background:

Term Definition
Accident, Major Major - an accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged.
Accident, Serious Serious - an accident in which at least one of two conditions is met: 1) there was one fatality without substantial damage to a Part 121 aircraft, 2) there was at least one serious injury and a Part 121 aircraft was substantially damaged.
Accident, Injury Injury - a nonfatal accident with at least one serious injury and without substantial damage to a Part 121 aircraft.
Accident, Damage Damage - an accident in which no person was killed or seriously injured, but in which any aircraft was substantially damaged.

Since this line plot has a lot of overlapping sections, it could be confusing for readers to understand the real differences in trends. Plotting with a stacked bar chart will provide another lens through which to view this.

Again, this is the number of accidents, and the different types of accidents over the past 3 decades.

In [108]:
# Set color scheme for stacked bar chart
colors = ['#f53b11', '#f0701a', '#ed9134', '#f0b54f', '#f2d999']

# Plot all data as separate bars
accidents_all_clean.loc[:,['Accidents, Fatal', 'Accidents, Major','Accidents, Serious', 'Accidents, Injury',
                           'Accidents, Damage']].plot.bar(stacked=True, color=colors, figsize=(17,8))

# Add title, axes, and legends
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')

plt.savefig('results/accidents_bars.png')
plt.show()

Next, we will look more specifically at just fatalities instead of all accidents, because as previously discussed accidents can range from fairly minor to very severe.

In [109]:
# Set up the plot
fig = plt.figure(1, figsize=(18, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot both total fatalities and fatalities on board
plt.plot(accidents_all_clean['Fatalities, Total'], "-", color = 'black', label = 'Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard'], "--", color = 'green', label = 'Fatalities, Aboard')

# Add title, axes labels, and legend to plot
plt.title('Fatalities Resulting from Aircraft Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
plt.xticks(accidents_all_clean['Year'])
ax.set_ylabel('Fatalities Due to Aviation')
plt.legend(loc='upper left')

fig.savefig('results/fatalities_lines.png')
plt.show()

As we can see from this plot, there is markedly lower fatalities after about 2002. Rarely are there very many more fatalities (on the ground) than the passengers aboard, however this total is included in the black line.

  • 1 b. How has the ratio of accidents/fatalies per annual miles flown in the United States changed over the past ~30 years?

For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross miles flown per year. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of miles flown per year. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.

Before we dive right in to analyzing rates, its important to understand the other half of the numbers first. Before looking at how the rates of accidents have changed over time, first let's look at how the quantity of air travel has changed over time. First we will look at the number of departures and the number of miles flown.

In [110]:
# Set up the plot
fig = plt.figure(1, figsize=(18, 5))
ax = fig.add_subplot(1, 1, 1)

# Plot departure data as a bar chart
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Departures']/1000000, align='center', alpha=0.5, color='#f59bfa')

# Add legends, titles, axes labels
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Departures in Millions')
plt.title('Increase in Departures in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')

fig.savefig('results/departures.png')
plt.show()

There is a clear increase in the number of overall departures, especially with a bigger spike in the late 90s.

In [111]:
# Set up the plot
fig = plt.figure(1, figsize=(18, 6))
ax = fig.add_subplot(1, 1, 1)

# Plot the miles flown data as a bar chart
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Miles Flown']/1000000000, align='center', alpha=0.5, color='#28d48c')

# Add legends, titles, axes labels
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Miles Flown (Billions)')
plt.title('Increase in Miles Flown (in billions) in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')

fig.savefig('results/miles.png')
plt.show()

Unsurprisingly, for miles flown we see a similar trend as the total number of departures.

Now that we have these numbers, we can look at the rate of accidents in comparison to the number of miles flown in that year.

In [112]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot accident data as a line plot
plt.plot(1000000 * accidents_all_clean['Accidents, All']/accidents_all_clean['Miles Flown'], "-", color = 'black', label = 'Accidents Per Millions Miles, All')
plt.plot(1000000 * accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Miles Flown'], "--", color = 'red', label = 'Accidents Per Millions Miles, Fatal')
plt.plot(1000000 * accidents_all_clean['Accidents, Major']/accidents_all_clean['Miles Flown'], "--", color = 'orange', label = 'Accidents Per Millions Miles, Major')
plt.plot(1000000 * accidents_all_clean['Accidents, Serious']/accidents_all_clean['Miles Flown'], "--", color = 'green', label = 'Accidents Per Millions Miles, Serious')
plt.plot(1000000 * accidents_all_clean['Accidents, Injury']/accidents_all_clean['Miles Flown'], "--", color = 'blue', label = 'Accidents Per Millions Miles, Injury')
plt.plot(1000000 * accidents_all_clean['Accidents, Damage']/accidents_all_clean['Miles Flown'], "--", color = 'purple', label = 'Accidents Per Millions Miles, Damage')

# Add titles, legends, axes labels
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents per Million Miles Flown')
plt.legend(loc='upper right')

fig.savefig('results/accidents_miles_lines.png')
plt.show()

To bring attention to how low this overall number of accidents is, I will do a bit of math to put things in perspective. Looking specifically at fatal accidents:

In [141]:
fatal_rate_df = 1000000 * accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Miles Flown']
fatal_rate_df
Out[141]:
Year
1983    0.001303
1984    0.000292
1985    0.001928
1986    0.000747
1987    0.001147
1988    0.000666
1989    0.002389
1990    0.001213
1991    0.000829
1992    0.000794
1993    0.000190
1994    0.000730
1995    0.000531
1996    0.000851
1997    0.000597
1998    0.000148
1999    0.000282
2000    0.000399
2001    0.000823
2002    0.000000
2003    0.000275
2004    0.000252
2005    0.000367
2006    0.000246
2007    0.000120
2008    0.000248
2009    0.000268
2010    0.000132
2011    0.000000
2012    0.000000
2013    0.000261
2014    0.000000
dtype: float64

Lets review the numbers quickly. Big numbers are hard to grasp.

The last non-zero fatal accident rate was 0.000261 accidents per million miles.

In other words 0.261 per billion miles.

Since 0.261 * 4 ~ 1, let's estimate that this is 1 accident per 4 billion miles.

Seattle to New York is 2422 miles. Let's say you flew this route every single day.

You would need to fly from Seattle to New York every day for 1,651,527 days before you would be in a fatal accident based on recent statistics.

This is the equivalent of flying this route every day for 4,524 years.

Again, in this next plot we will put the overall fatalities in perspective with the number of annual miles flown in the United States.

In [113]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot the number of fatalities per million miles
plt.plot(1000000 * accidents_all_clean['Fatalities, Total']/accidents_all_clean['Miles Flown'], "-", color = 'black', label = 'Fatalities Per Million Miles, Total')
plt.plot(1000000 * accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Miles Flown'], "--", color = 'green', label = 'Fatalities Per Million Miles, Aboard')
plt.xticks(accidents_all_clean['Year'])

# Add titles, axes, and legend
plt.title('Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Fatalities Per Million Miles Flown')
plt.legend(loc='upper left')

fig.savefig('results/fatalities_miles_lines.png')
plt.show()

It appears that even with the increase in air travel, the overall rate has decreased over the past 30 years which is good.

  • 1 c. How has the ratio of accidents/fatalies per annual flights hours in the United States changed over the past ~30 years?

For these, I will again use tables 2 and 3 for accidents and fatalities respectively. Both of these tables contain rate information. I may address this question in several plots. First, plotting with a line or bar the gross annual flight hours. This will bring light to the huge spike in air travel in the past 30 years. After this, I will plot accidents and fatalities as a rate of annual flight hours. This will show that proportional to the increase in air travel over the years, the likelihood of being in a plane that gets in an accidents has gone down dramatically.

Again, before diving straight in to analyzing rates, its important to understand the objective numbers of how much air travel has changed over time.

In [114]:
# Set up the plot
fig = plt.figure(1, figsize=(18, 6))
ax = fig.add_subplot(1, 1, 1)

# Plot the number of flight hours
plt.bar(accidents_all_clean['Year'], accidents_all_clean['Flight Hours']/1000000, align='center', alpha=0.5)

# Add titles, axes, and legend
plt.xticks(accidents_all_clean['Year'])
plt.ylabel('Hours in Millions')
plt.title('Increase in Flight Hours in the United States over Time')
ax.set_xlabel('Year 1983 - 2014')

fig.savefig('results/hours_bars.png')
plt.show()

Annual flights hours increases similarly as miles flown and number of departures.

In [115]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot accident data per flight hours as a line plot
plt.plot(1000000*accidents_all_clean['Accidents, Major']/accidents_all_clean['Flight Hours'], "--", color = 'orange', label = 'Accidents Per Million Flight Hours, Major')
plt.plot(1000000*accidents_all_clean['Accidents, Fatal']/accidents_all_clean['Flight Hours'], "--", color = 'red', label = 'Accidents Per Million Flight Hours, Fatal')
plt.plot(1000000*accidents_all_clean['Accidents, All']/accidents_all_clean['Flight Hours'], "-", color = 'black', label = 'Accidents Per Million Flight Hours, All')

# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents per Million Flight Hours')
plt.legend(loc='upper left')

fig.savefig('results/accidents_major_hours_lines.png')
plt.show()

While total accidents stays somewhat stable, the fatal and major accidents (the most important ones) have gone down over time even with respect to the number of flight hours.

In [116]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot accident data per flight hours as a line plot
plt.plot(1000000*accidents_all_clean['Accidents, Serious']/accidents_all_clean['Flight Hours'], "--", color = 'green', label = 'Accident Rate, Serious')
plt.plot(1000000*accidents_all_clean['Accidents, Injury']/accidents_all_clean['Flight Hours'], "--", color = 'blue', label = 'Accident Rate, Injury')
plt.plot(1000000*accidents_all_clean['Accidents, Damage']/accidents_all_clean['Flight Hours'], "--", color = 'purple', label = 'Accident Rate, Damage')

# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents per Million Flight Hours')
plt.legend(loc='upper left')

fig.savefig('results/accidents_minor_hours_lines.png')
plt.show()

Major

An accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged.

Serious

An accident in which any of three conditions is met: 1) a Part 121 aircraft was destroyed, or 2) there were multiple fatalities, or 3) there was one fatality and a Part 121 aircraft was substantially damaged.

Injury

A nonfatal accident with at least one serious injury and without substantial damage to a Part 121 aircraft.

Damage

An accident in which no person was killed or seriously injured, but in which any aircraft was substantially damaged.

In [117]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot fatalities as a line plot
plt.plot(1000000*accidents_all_clean['Fatalities, Total']/accidents_all_clean['Flight Hours'], "-", color = 'black', label = 'Fatality Rate, Total')
plt.plot(1000000*accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Flight Hours'], "--", color = 'green', label = 'Fatality Rate, Aboard')
plt.xticks(accidents_all_clean['Year'])

# Add titles, axes, and legend
plt.title('Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Fatalities Per Million Flight Hours')
plt.legend(loc='upper left')

fig.savefig('results/fatalities_hours_lines.png')
plt.show()
In [147]:
fatality_rate_df = 1000000*accidents_all_clean['Fatalities, Aboard']/accidents_all_clean['Flight Hours']
fatality_rate_df
Out[147]:
Year
1983     1.918124
1984     0.489888
1985    60.276279
1986     0.701677
1987    21.605998
1988    24.594840
1989    24.479928
1990     0.987645
1991     4.159377
1992     2.508148
1993     0.000000
1994    18.058085
1995    11.995329
1996    25.461745
1997     0.378833
1998     0.000000
1999     0.626595
2000     5.027527
2001    29.470886
2002     0.000000
2003     1.202219
2004     0.741427
2005     1.031458
2006     2.543709
2007     0.050923
2008     0.052283
2009     2.893316
2010     0.112670
2011     0.000000
2012     0.000000
2013     0.508683
2014     0.000000
dtype: float64

Last non-zero fatality rate was 0.508683 fatalities per million flight hours.

This is 1 fatality per 1,965,860 flight hours.

Flying from Seattle to New York is 4.5 hour flight.

If you were to do this flight every single day, it would take 436,857 days, or 1,196 years of flying this route every single day before it was likely that you would have been in a fatal aircraft accident.

  • 1 d. How do the rates of accidents/fatalities vary between commercial* and all public and private flights?

After addressing parts 1a-1c for just commercial* flights, I have data in table 4 contains the same accident, fatality, and rate data but instead for general aviation in the United States. I will use this to compare the relative safety of commercial flight vs general flight (include private and personal flights). I will choose just one metric - accidents per flight hour, and plot both the commercial and general aviation lines on the same plot in different colors.

In [328]:
genav_clean.head(5)
Out[328]:
Year Accidents, All Accidents, Fatal Fatalities, Total Fatalities, Aboard Flight Hours Accidents per 100,000 Flight Hour, All Accidents per 100,000 Flight Hour, Fatal
0 1975 3995 633 1252 1231 28799000 13.87 2.19
1 1976 4018 658 1216 1203 30476000 13.17 2.16
2 1977 4079 661 1276 1265 31578000 12.91 2.09
3 1978 4216 719 1556 1398 34887000 12.08 2.06
4 1979 3818 631 1221 1203 38641000 9.88 1.63

Based on the data that I have for general aviation, there are 3 things I will compare to commercial aviation:

  • Quantity of Accidents
  • Quantity of Fatalities
  • Rate of Accidents per Million Flight Hours
  • Rate of Fatalities per Million Flight Hours
In [118]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot accident data as a line plot
plt.plot(accidents_all_clean['Accidents, All'], "-", color = 'blue', label = 'Commercial Accidents, All')
plt.plot(accidents_all_clean['Accidents, Fatal'], "--", color = 'blue', label = 'Commercial Accidents, Fatal')
plt.plot(genav_clean['Accidents, All'], "-", color = 'orange', label = 'General Aviation Accidents, All')
plt.plot(genav_clean['Accidents, Fatal'], "--", color = 'orange', label = 'General Aviation Accidents, Fatal')

# Add titles, axes, and legend
plt.title('Aircraft Accidents Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Accidents')
plt.legend(loc='upper right')

fig.savefig('results/public_private_accidents.png')
plt.show()

We see in the above graph the overall number of accidents for both private and commercial flight. The number of commercial flights appears to stay somewhat constant, which we saw in previous plots as well. However, the number of accidents in general aviation has decreased dramatically. We will explore some of the potential reasons for this later on in the analysis, but I suspect that a decrease in the volume of private flight attributes for most of this drop in private flight accidents.

In [119]:
# Set up plot
fig = plt.figure(1, figsize=(14, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot number of accidents for both commercial and general aviation
plt.plot(accidents_all_clean['Fatalities, Total'], "-", color = 'blue', label = 'Commercial Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard'], "--", color = 'blue', label = 'Commercial Fatalities, Aboard')
plt.plot(genav_clean['Fatalities, Total'], "-", color = 'orange', label = 'General Aviation Fatalities, Total')
plt.plot(genav_clean['Fatalities, Aboard'], "--", color = 'orange', label = 'General Aviation Fatalities, Aboard')

# Add titles, axes, and legend
plt.title('Aircraft Fatalities Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Fatalities')
plt.legend(loc='upper right')

fig.savefig('results/public_private_fatalities.png')
plt.show()

The above plot shows a downward trend of aircraft fatalities over time. This is a good thing and somewhat expected as technology improves. The private sector does appear overall to have a much higher rate of fatalities.

In [120]:
# Set years to be labels
labels = accidents_all_clean['Year'].values

# Make years to be the x axis
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

# Create bar plot
fig, ax = plt.subplots(figsize=(16, 8))
rects1 = ax.bar(x - width/2, accidents_all_clean['Flight Hours']/1000000, width, label='Commercial Flight Hours')
rects2 = ax.bar(x + width/2, genav_clean['Flight Hours'].loc[1983:2015]/1000000, width, label='General Aviation Flight Hours')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Flights Hours in Millions')
ax.set_xlabel('Years 1983 - 2014')
ax.set_title('Change in Annual Flight Hours for Commercial and General Flight')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

fig.savefig('results/public_private_hours.png')
plt.show()

Interestingly, we see that the number of flights hours for commercial flight has increased significantly over the past three decades. In comparison, general aviation, or private flight, has actually dropped over the decades. This is an interesting finding, because I would have thought they both would have increased over time. This could be attributed to many confounding factors such as: a drop in overall number of pilots, increased cost of fuel, increase of commercial flights in to smaller airports thereby making them too busy for private flight.

In [121]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot accident data as a line plot respective to total number of flight hours
plt.plot(accidents_all_clean['Accidents, All']/(accidents_all_clean['Flight Hours']/1000000), 
         "-", color = 'blue', label = 'Commercial Accidents per Millions Hours, All')
plt.plot(accidents_all_clean['Accidents, Fatal']/(accidents_all_clean['Flight Hours']/1000000),
         "--", color = 'blue', label = 'Commercial Accidents per Millions Hours, Fatal')
plt.plot(genav_clean['Accidents, All']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000), 
         "-", color = 'orange', label = 'General Aviation Accidents per Million Hours, All')
plt.plot(genav_clean['Accidents, Fatal']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000), 
         "--", color = 'orange', label = 'General Aviation Accidents per Million Hours, Fatal')

# Add some text for labels, title and custom x-axis tick labels, etc.
plt.title('Aircraft Accidents Per Million Flight Hours in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Accidents Per Million Flight Hours')
plt.legend(loc='upper right')

fig.savefig('results/public_private_accidents_hours.png')
plt.show()
In [122]:
# Set up the plot
fig = plt.figure(1, figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)

# Plot fatality rates data as a line plot
plt.plot(accidents_all_clean['Fatalities, Total']/(accidents_all_clean['Flight Hours']/1000000),
         "-", color = 'blue', label = 'Commercial Fatalities, Total')
plt.plot(accidents_all_clean['Fatalities, Aboard']/(accidents_all_clean['Flight Hours']/1000000),
         "--", color = 'blue', label = 'Commercial Fatalities, Aboard')
plt.plot(genav_clean['Fatalities, Total']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
         "-", color = 'orange', label = 'General Aviation Fatalities, Total')
plt.plot(genav_clean['Fatalities, Aboard']/(genav_clean['Flight Hours'].loc[1983:2015]/1000000),
         "--", color = 'orange', label = 'General Aviation Fatalities, Aboard')

# Add some text for labels, title and custom x-axis tick labels, etc.
plt.title('Aircraft Fatalities Per Million Hours Over Time in the United States')
ax.set_xlabel('Year 1983 - 2014')
ax.set_ylabel('Number of Fatalities')
plt.legend(loc='upper right')

fig.savefig('results/public_private_fatalities_hours.png')
plt.show()

After we normalize for the number of flight hours, we plot the chart again and see a slightly different story. We see that on average, private flight is still more dangerous. However, in years where there is significant bump for the commercial area, it is around the same for that year as the private flight fatalities.

This makes sense, because when there is a commercial accident, even if it just one plane, it is bound to be a way larger plane. Therefore, there would be as many fatalities as possibly 30 private crashes.

We can infer from this that commercial travel is safer primarly because there are fewer accidents. However, if there is an accident that year, it dramatically changes the picture since one commercial aircraft can carry so many people.

  1. What kinds of planes are responsible for the most accidents?

Finally, in looking at source 1, my most detailed data source, I can see more granular information for each accident. This contains the type of airplane for every data entry. To present to the reader what planes are responsible for the most crashes, I will create a pivot table that sums the number of accidents for each plane. Then, I will create a table for the reader that displays just the top 5 planes. Since I do not expect the reader to have a knowledge of all types of planes, I will either include an image or brief description of the plane.

This methodology was further refined in to two sub questions. One looked at the most common plane amongst all incidents, the second method looked at the most common plane in fatal accidents.

In [31]:
faa_aids.columns
Out[31]:
Index(['AIDS Report Number', 'Local Event Date', 'Event City', 'Event State',
       'Event Airport', 'Event Type', 'Aircraft Damage', 'Flight Phase',
       'Aircraft Make', 'Aircraft Model', 'Aircraft Series', 'Operator',
       'Primary Flight Type', 'Flight Conduct Code', 'Flight Plan Filed Code',
       'Aircraft Registration Nbr', 'Total Fatalities', 'Total Injuries',
       'Aircraft Engine Make', 'Aircraft Engine Model', 'Engine Group Code',
       'Nbr of Engines', 'PIC Certificate Type', 'PIC Flight Time Total Hrs',
       'PIC Flight Time Total Make-Model'],
      dtype='object')

For the first analysis we look at all incidents. Not all of these are fatal, in fact most of them are not, it just means there is an incident of some sort recorded in the database for these planes.

In [88]:
# Next, we essentially create a pivot table of the make/model of the airplanes
plane_type_subset = faa_aids[['AIDS Report Number', 'Aircraft Make','Aircraft Model']].copy()
plane_type_subset['Aircraft Make Model'] = plane_type_subset['Aircraft Make'] + ' ' + plane_type_subset['Aircraft Model']

# Drop rows with no make/model data
plane_type_subset = plane_type_subset.replace('nan', np.nan)
plane_type_subset_clean  = plane_type_subset.dropna()

# Create a count table
type_count = plane_type_subset_clean.groupby(['Aircraft Make Model']).count()

# Drop unused columns
type_count = type_count.drop(columns=["Aircraft Make","Aircraft Model"])

# Rename one column
type_count = type_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})
In [94]:
# Sort and display by top count
df = type_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(5)
df
Out[94]:
Number of Incident Reports
Aircraft Make Model
CESSNA CE-172 5382
PIPER PA-28 4282
BOEING 727 3442
CESSNA CE-210 3308
MOONEY M-20 3142
In [123]:
df.to_csv('results/incidents_by_make_model.csv')
In [137]:
## Cessna 172
Image(filename='results/Cessna172S.jpg') 
Out[137]:
In [138]:
## Piper PA-28
Image(filename='results/PA-28.jpg') 
Out[138]:
In [135]:
## Boeing 727
Image(filename='results/B-727.jpg') 
Out[135]:
In [139]:
## Cessna 210
Image(filename='results/Cessna210.jpg') 
Out[139]:
In [140]:
## Mooney M-20
Image(filename='results/M-20.jpg') 
Out[140]:

We see a variety of planes here for the highest number of incidents. Note incidents or accidents can be minor sometimes, so this is probably why there doesn't seem to be a common thread between these planes. There is a variety of single engine, twin engine, small, and large size planes on this list.

For the next analysis we will look at the aircraft with the most fatal accidents.

In [100]:
# Next, we essentially create a pivot table of the make/model of the airplanes
fatal_plane_type_subset = faa_aids[['AIDS Report Number', 'Aircraft Make','Aircraft Model', 
                                    'Total Fatalities']].copy()
fatal_plane_type_subset['Aircraft Make Model'] = fatal_plane_type_subset['Aircraft Make'] + ' ' + fatal_plane_type_subset['Aircraft Model']

# Drop rows with no make/model data
fatal_plane_type_subset = fatal_plane_type_subset.replace('nan', np.nan)
fatal_plane_type_subset_clean  = fatal_plane_type_subset.dropna()

# Drop items with zero fatalities
fatal_plane_type_subset_clean = fatal_plane_type_subset_clean[fatal_plane_type_subset_clean['Total Fatalities'] > 0]

# Group by make/model, Sort by largest count
fatal_type_count = fatal_plane_type_subset_clean.groupby(['Aircraft Make Model']).count()

# Rename columns
fatal_type_count = fatal_type_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})

# Drop unused columns
fatal_type_count = fatal_type_count.drop(columns=["Aircraft Make","Aircraft Model","Total Fatalities"])
In [101]:
# Sort and display top 5 by largest count
df = fatal_type_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(5)
df
Out[101]:
Number of Incident Reports
Aircraft Make Model
CESSNA CE-182 216
DE HAVILLAND-BOMBARDIER DHC-6 TWIN OTTER 55
CESSNA CE-180 40
CESSNA CE-206 39
DOUG DC-3 38
In [124]:
df.to_csv('results/fatalities_by_make_model.csv')

Futhermore, the two planes which are not Cessnas happen to be twin engine airplanes. This is also not surprising to me, because twin engines are much more difficult to fly than a single engine. A pilot who is rusty could easily be overwhelmed in a twin engine. I also know this anecdotely because my uncle is a pilot. He was a helicopter pilot in the airforce in the Vietnam war. He has maintained his pilots license and owns multiple planes, however he sold his twin engine plane a couple of years ago because all of the crash reports he read were older men, often ex-military, crashing on twin engine planes that they simply did not have enough recent experience on it. In the interests of safety, my uncle his downgraded to a single engine Cessna. Hopefully he stays safe!

In [130]:
## Cessna 182
Image(filename='results/Cessna182.jpg') 
Out[130]:
In [131]:
## DHC
Image(filename='results/DHC-6.jpg') 
Out[131]:
In [132]:
## Cessna 180
Image(filename='results/Cessna180.jpg') 
Out[132]:
In [133]:
## Cessna 206
Image(filename='results/Cessna206.jpg') 
Out[133]:
In [134]:
## DC-3
Image(filename='results/DC-3.jpg') 
Out[134]:
  1. How does the experience level of the pilots impact airplane crashes?

A question a reader may ask if they are boarding a plane is - how experienced is my pilot? Data source 1, the detailed table, also contains a column with number of hours of experience of the Pilot In Control for most accidents. I will extract this column and create a histogram so that the reader can see the distribution of number of experience hours of pilots who have been in crashes, and observe for themselves if they think that number of hours could be a factor in safety.

First, before diving straight in the number of hours of experience, I want to do a bit of a qualitative analysis on what kinds of pilots get in the most accidents. I have the type of certification of the pilots in this dataset, so I should be able to see who may be more prone to accidents.

In [104]:
# Next, we subset the data to just the columns of interest
PIC_subset = faa_aids[['AIDS Report Number', 'Aircraft Damage', 'Total Fatalities', 'PIC Certificate Type', 
                       'PIC Flight Time Total Hrs', 'PIC Flight Time Total Make-Model', 
                      'Aircraft Make', 'Aircraft Model']].copy()
PIC_subset['Aircraft Make Model'] = PIC_subset['Aircraft Make'] + ' ' + fatal_plane_type_subset['Aircraft Model']

# Drop rows which have NA for any of the 3 PIC columns of interest
PIC_subset = PIC_subset.dropna(subset=['PIC Certificate Type', 'PIC Flight Time Total Hrs', 'PIC Flight Time Total Make-Model'])

# Group by PIC certificate type, Sort by largest count
PIC_cert_count = PIC_subset.groupby(['PIC Certificate Type']).count()

# Drop unused columns
PIC_cert_count = PIC_cert_count.drop(columns=["Aircraft Make","Aircraft Model","Total Fatalities","Aircraft Damage","Total Fatalities","PIC Flight Time Total Hrs", "PIC Flight Time Total Make-Model", "Aircraft Make Model"])

# Rename column
PIC_cert_count = PIC_cert_count.rename(columns={"AIDS Report Number": "Number of Incident Reports"})

# Sort and display by largest count
df = PIC_cert_count.sort_values(by =['Number of Incident Reports'], ascending = False).head(10)
df
Out[104]:
Number of Incident Reports
PIC Certificate Type
PRIVATE PILOT 27551
COMMERCIAL PILOT 17037
AIRLINE TRANSPORT 14974
COMMERCIAL PILOT FLIGHT INSTRUCTOR 6974
STUDENT 5426
AIRLINE TRANSPORT PILOT FLIGHT INSTRUCTOR 3768
UNKNOWN/FOREIGN 687
PRIVATE PILOT FLIGHT INSTRUCTOR 159
PILOT NOT CERTIFICATED 102
RECREATIONAL PILOT 21
In [125]:
df.to_csv('results/incidents_by_PIC_type.csv')

As we see from the above, private pilot are most likely to be in an accident by far. Close behind are commercial & airline pilots. Likely because they fly most frequently, even if not for a scheduled service.

Next we move in to the quantitative analysis looking at the pilots actual number of hours of experience.

In [90]:
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours

# Set up the plot
fig = plt.figure(1, figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)

# Create histogram with data
plt.hist(PIC_subset['PIC Flight Time Total Hrs'], 100)

# Legends, titles, axes labels
plt.title('Flight Time Experience of Pilots in Accidents')
ax.set_ylabel('Number of Incidents')
ax.set_xlabel('Hours of Experience')

plt.show()
In [91]:
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours in make/model

# Set up the plot
fig = plt.figure(1, figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)

# Legends, titles, axes labels
plt.title('Flight Time Experience on Make/Mode of Pilots in Accidents')
ax.set_ylabel('Number of Incidents')
ax.set_xlabel('Hours of Experience on Model')

# Create histogram with data
plt.hist(PIC_subset['PIC Flight Time Total Make-Model'], 100)

plt.show()
In [126]:
# Look at a histogram of the distributions of pilots experience
# Experience here is measured by total number of flight hours

# Set up the plot
fig, (ax1, ax2) = plt.subplots(2, figsize = (10,8))
fig.suptitle('Experience Level of Pilots in Accidents')

ax1.set_xlabel('Hours of Experience')
ax1.set_ylabel('Number of Accidents Reported')

ax2.set_xlabel('Hours of Experience on Make/Model')
ax2.set_ylabel('Number of Accidents Reported')

# Create histogram with data
ax1.hist(PIC_subset['PIC Flight Time Total Hrs'], 100)
ax2.hist(PIC_subset['PIC Flight Time Total Make-Model'], 100)

fig.savefig('results/PIC_experience_hours.png')
plt.show()

To make this final analysis a little bit easier, we combine the final two charts: Hours of Experience and Hours of Experience on Make/Model on to one Y axis so that it can easily be compared. We see that in both cases, the histogram is very left skewed. But this is even more prominent with the numbers of hours of experience on that make and model specifically. There is almost double the number of incidents in the first bin of the histogram on the second chart as there is on the first.

VIII. Conclusion

Findings and Implications

To summarize some of the main findings of this analysis, the total number of accidents and fatalities has not changed too much over time. However, the quantity of air travel has changed a lot over time. So, respective to the increase in popularity of air travel, flight has become safer.

When comparing public flight and private flight, private flight clearly appears to be more dangerous. Not just the grand totals, but respective to the amount of flight hours. The next natural question is - why would this be? I have a couple of hypotheses. Some of the following research questions help to answer this.

We saw that most of the fatal crashes were from quite small planes (Cessnas) with a few exceptions of some more complicated twin engine airplanes. We also saw the importance of pilot's experience as an attributing factor to safety. The pilot's experience on that specific make/model of the airplane turned out to be even more important than the pilots experience in general.

One conclusion I will draw is that larger planes (which are used for commercial flights) end up being safer. This is because the larger the plane is, the more hours that are required by the FAA to be certified. The more flight hours they have, the more experienced the pilot it. As we saw, the more experienced the pilot, the less likely you are to be in a fatal crash, or any crash at all. This is where regulation comes in to play - the FAA has different requirements to be licensed to fly each different type of aircraft. A pilots license on one aircraft is not automatically transferrable to another. We can see this is for a good reason.

Another conclusion I want to explore is why sheer numbers make it unprobable that a commercial airliner would get in to an accident. The larger a plane is, the more expensive it is. An expensive plane is unlikely to be privately owned, and fewer of them probably exist in general because few companies can afford them. Since there are fewer large planes in terms of numbers, there are less chances for them to crash. This is why the visuals all brought attention to the fact that private flight appeared a lot more dangerous than commercial flight.

However, another point to note is that on average, commercial flight was safer (using the metric of safety as fatalities per million miles), but for years where there was a commercial accident, the death tolls were around the same. This is because there could be 30 private flight accidents but they likely won't have more than 4 people on them. Just one commercial airline accident could significantly change the outlook of these numbers. This is why the safety of commercial flight is so much moreconsequential and so much more widely publicized than the accidents in the private sector. There are simply more lives at risk in one large passenger plane.

All in all, there is one final connection I want to make. Some of the reasons I attributed for the relative safety of commercial flight over private flight were:

  • more hours of experience required for certification to fly larger planes
  • larger planes are more expensive and fewer exist in general

I'd like to draw a parallel to another mode of transportation: cars. It's not a secret that driving is a lot more dangerous than flying. The same number of people die every month in the United States from automobile accidents as were killed in the 9/11 terrorism attack. I think what we learned in this analysis is applicable as well to car safety. I would suspect some of the key reasons driving is less safe is because cars are very cheap in comparison. Everyone has them. And there are few requirements to be certified and given a drivers license. I would be interested to see the same plot with drivers hours of experience behind the wheel. What has helped aviation in this quest for safety has been tighter regulation around requirements for certification. Perhaps something to consider for future discussion is the requirements for getting a drivers license.

Limitations

Some limitations of this data set is that it only went from 1983-2014. 2014 is not very recently, but unfortunately the FAA releases their cleaned data in 5 year increments, from what I read. Nonetheless, this data would be useful to ensure that the trends still hold for the past 5 years. Obviously it would have been more interesting and useful to readers if this analysis included this past years data. Another limitation of this data is that a lot of data sources are in a roll up agreggate format. This is convenient but can sometimes lead to some ambigutity as to what all the different categories are. This can require some digging in to a lot of aviation jargon to sort out the distinctions and ensure there is no double counting.

Reflection on Human Centered Approach to Data Science

Throughout this process - from choosing a topic, to choosing data, to creating a code repo, I utilized many of the principles of human centered data science I learned in this class. The topic naturally came to me, since I work at boeing. I get asked probably once a week "what's it like working at boeing right now?". The truth is - our day to day work as engineers not on the 737Max program is the same. This is a sign of good upper management if you ask me - that's not my job to deal with! But the reality is, the 737Max issue has gotten political. As soon as something gets political, the journalism quality inherently goes down and fact finding becomes difficult. I have seen this happen with the Max and it is frustrating to watch. I think many other share the same frustration as me with the media taking hold of any hot topic it can and beating it to death. I truly think an objective, data driven analysis on the facts of airplane safety would be useful for the general public right now. I know this is something I would like to see as an outsider. Ultimately, you want to arm the general public with honest, truthful information so we can start to overcome the wave of #fakenews from 2016-2018.

I thought carefully about the ethics of this before I posed my questions. Is there a chance this could slander or degrade Boeing? Could it slander another company? If it is just the numbers, is it really even slander? I ultimately determined that as long as I just focused on the United States (the only demographic for which I had good data), I could maintain the integrity of the analysis. Since the FAA releases all of this data, there is nothing that I could find that the FAA probably hasn't unearthed. Not to mention, anyone else could (and might have already) found this data and posted their findings all over huffington post for the world to see. I had some confidence that the results I found would not be shocking findings for the aerospace industry, but I tried not to let this influence my analysis.

That was of course another factor - did I think that I could be unbiased, as an employee of the Boeing company? To eliminate biases, I tried to be thoughtful about my own research questions. I did not look at differences between Boeing and Airbus, and I did not focus on any specific plane. Instead, I tried to keep my investigation to comparing the differences between private and commercial flight at large.

Because of the nature of this data, and the danger of "predictions" when it comes to people's safety, I decided not to pursue any sort of predictive models for my analysis or research questions. Unless you are trying to predict a crash, there really isn't too much you can classify and predict in this field. Even in the case of predicting an accident, any model created would have a very high accuracy only because accidents are so infrequent. When an accident actually did occur, it would be unlikely the model would have performed correctly so false negatives would be high. For this reason, I just took a descriptive data visualization approach. I'd like for readers to be able to make connections for themselves, with the proper visualizations, but don't want to be prescriptive about the certainty of any individual factor as there can be so many confounding factors such as age of the plane, maintenance, and crew which cannot be accounted for in this data.

For the creation of my code repository, I followed the best practices I learned from this class on data science reproducibility. I have included all my code, my source data, my cleaned data, cited my sources, provided licenses where appropriate, and noted limitations and disclaimers where appropriate. My final repo is well organized with a couple folders, and a README explaining each folder and file.

Interpretibility is another key concept covered in human-centered data science. In this course, the focus was more on algorithmic interpretibility. Since that was not necessary relevant for my analysis, I instead focused on visualization interpretibility. Thankfully, last quarter I took the data visualization course where we often focused on the integrity of visualizations. Does the visualization accurately communicate it to the user, or does it distort the data to achieve a certain end? Certainly, ethics is a question to ask when creating visualizations as well. What you choose to display, what dimensions you use, and what data you use can all distort the results you are presenting to the reader. With the intention of human centered integrity, I tried to remember the tenants of this course as well and choose appropriate and diverse visualizations to present to the readers, accompanied with an analysis. All of the final visualizations are available in a folder in the repo called "results". It contains all the plots saved as pngs.

Overall, I really enjoyed being able to choose my own topic for this analysis. I could really pair it with my passions. Ultimately, I am proud of what I put together - I believe it is thorough, detailed, reproducible, well documented, and truthful. I am excited to show this repo to folks that I work with!

VIII. Notes for Reuse

For these data sets, accidents which were caused by an illegal act such as suicide, sabotage, or terrorism were NOT included in rate calculations. However, it is factored in to the overall accident and fatality numbers. For example, 9/11 2001 terrorism fatalies & accidents are included but do not affect the rate calculations for accidents per million miles flown or fatalies per million miles flown. Additionally, only those killed on board the planes in the 9/11 terrorist attack are included in the fataly numbers. Unless otherwise stated all fatalies are on board fatalies. Information on acts of Suicide, Sabotage, or Terrorism on 14 CFR 121 Flights are available elsewhere.

There are some inconsistencies in the data as to which year is the last complete year of data. In general, all data stops at 2015 because the FAA releases a complete, verified set of data every 5 years (the next big release is 2020). Some data is incomplete for 2015 and some data is incomplete for 2014.